Metabolomics Filtering Analysis

Summary

Filtered metabolomics data using a two-step process: blank subtraction followed by 80% cumulative signal threshold.

Original Data

Peaks
unique peaks before filtering

After Blank Filter

Peaks
peaks removed

Final (80% Filter)

Peaks
% of original kept
Two-Step Filtering: First, we remove peaks that appear in blanks (contamination). Then we apply 80% cumulative signal filtering to keep only the most significant peaks.

Blank Filtering Results

CategoryCountDescription
Sample OnlyPeaks only in samples, not in blanks (auto-kept)
Passed ValidationPassed both fold-change (≥3x) AND statistical test (p<0.05, FDR-corrected)
ContaminationFailed fold-change or statistical test (removed)
Blank OnlyPeaks only in blanks (not in samples)
No SignalNo detectable signal in any sample or blank

Statistical Validation Details

CriterionPassedFailedDescription
Fold-Change ≥3xSample mean ≥ 3× blank mean (pmp/Bioconductor)
Welch's t-testp < 0.05 (FDR-corrected, one-sided)
Insufficient DataNot enough replicates for t-test (used fold-change only)
Validation Method: Publication-quality blank subtraction using dual criteria: (1) fold-change ≥3x for biological significance (pmp/Bioconductor), and (2) Welch's t-test with Benjamini-Hochberg FDR correction for statistical significance (p < 0.05).

Source Data

Key Finding: Root tissue reaches 80% much faster (avg ~ peaks) than leaf tissue (avg ~ peaks). This means roots have a few dominant compounds while leaves have signal spread across more compounds.

Leaf vs Root

Root samples have more concentrated signal - fewer peaks make up 80% of the total.

Leaf Avg

peaks to reach 80%

Root Avg

peaks to reach 80%

80% Threshold Results - Per Sample

Each row shows one tissue sample and how many peaks were needed to account for 80% of its total signal (after blank filtering). Samples with fewer peaks needed have more concentrated signal.

Leaf Tissue (12 samples)

#IDPeaks NeededSmallest Peak Kept

Root Tissue (11 samples)

#IDPeaks NeededSmallest Peak Kept

How to Read This Table

Example: If a sample needed 500 peaks to reach 80% of its signal, those 500 peaks are the most abundant compounds. The smallest peak kept might contribute 0.02% - anything contributing less was filtered as noise.

Chemical Richness & Diversity

Comparison of metabolite richness (peak counts) and Shannon diversity index across treatments.

Data Source: Blank-filtered dataset ( peaks after 3x blank subtraction).
NOT the 80% cumulative-filtered dataset ( peaks).
Richness and diversity are calculated BEFORE the 80% filter to capture full metabolome complexity.

Chemical Richness (Peak Counts)

Number of detected metabolites per sample (mean ± SE) out of total peaks

Leaf Tissue

Root Tissue

Shannon Diversity Index (H)

H = -Σ(p × ln(p)) where p = relative abundance (Vinaixa et al. 2012)

Leaf Tissue

Root Tissue

Interpretation:
Chemical Richness: Higher values indicate more metabolites detected. Differences may reflect stress responses or metabolic shifts.
Shannon Diversity: Higher H indicates more even distribution of metabolite abundances. Lower H suggests dominance by fewer compounds.

Methodology (Exact Calculations)

Chemical Richness
Data source: df_blank_filtered (after 3x blank subtraction)

For each sample column (e.g., "BL - Drought"):
    richness = COUNT of rows WHERE value > 0.0

For each treatment:
    mean = SUM(sample_richness_values) / n_samples
    SE = STDEV(sample_richness_values) / SQRT(n_samples)
Shannon Diversity Index
Data source: df_blank_filtered (after 3x blank subtraction)

For each sample column:
    1. Get all peak abundances where value > 0.0
    2. total = SUM(all abundances)
    3. For each peak: p = abundance / total
    4. H = -SUM(p × ln(p))

For each treatment:
    mean = SUM(H_values) / n_samples
    SE = STDEV(H_values) / SQRT(n_samples)

Why blank-filtered, not 80%-filtered? The 80% cumulative filter removes low-abundance peaks to focus analysis. But richness and diversity metrics should capture the FULL metabolome complexity, so we use the larger blank-filtered dataset.

References:
• Shannon diversity index: Vinaixa et al. 2012, Metabolites
• Chemical richness methods: Fu et al. 2020, Russian Journal of Plant Physiology

Compare Treatments

Toggle treatments to compare. Shows peaks that differ between selected treatments.

Select treatments:

Root Tissue

Leaf Tissue

m/z = mass-to-charge ratio. Molecular formulas assigned via MFAssignR (see Methods tab)
RT = retention time (minutes)
Abundance = normalized peak area (unitless, higher = more compound)

Treatment Overlap

Shows how peaks are distributed across treatment groups (Drought, Ambient, Watered). "Unique" means the peak is ONLY found in that treatment, not in the others.

Leaf Tissue ( peaks)

% of leaf peaks are shared across all treatments

Root Tissue ( peaks)

% of root peaks are shared across all treatments

Cross-Tissue Overlap

How peaks are shared between leaf and root tissues ( unique peaks total).

RegionPeaksMeaning
Leaf OnlyDetected in leaf but not root
Root OnlyDetected in root but not leaf
Both TissuesDetected in both leaf and root
% of peaks are found in both tissues
How to interpret: Peaks unique to a treatment (e.g., "Drought Only") are potential biomarkers for that stress condition. Peaks shared by all three are likely core metabolites present regardless of water availability. Peaks unique to a tissue (leaf-only or root-only) may reflect tissue-specific metabolism.

Two-Step Filtering Process

We use a two-step filtering approach to ensure data quality: first removing contamination, then keeping only significant peaks.

Step 1: Blank Subtraction (Contamination Removal)

Blanks are samples run through the instrument with no plant material - they capture background contamination from solvents, plastics, and the instrument itself.

How it works:
For each peak found in BOTH samples and blanks:
1. Calculate fold-change: sample_mean / blank_mean
2. Perform Welch's t-test (one-sided: sample > blank)
3. Apply Benjamini-Hochberg FDR correction for multiple testing
4. KEEP only if BOTH criteria pass:
   - Fold-change ≥ 3x (biological significance, per pmp/Bioconductor)
   - FDR-adjusted p-value < 0.05 (statistical significance)

Peaks ONLY in samples (not in blanks) → auto-KEEP

Why dual criteria? The 3x fold-change threshold (pmp/Bioconductor standard) ensures biological relevance (a peak must be meaningfully higher in samples). The statistical test ensures the difference isn't due to random variation. FDR correction accounts for testing thousands of peaks simultaneously.

Statistical Details:
Welch's t-test: Compares sample vs blank means without assuming equal variance
One-sided test: Tests if sample > blank (not just different)
FDR correction: Benjamini-Hochberg method controls false discovery rate at 5%
Leaf blanks:
Root blanks:

Step 2: 80% Cumulative Signal Threshold

After removing contamination, we filter to keep only the most abundant peaks.

How it works (for each sample separately):
1. Take all peaks and their area values for one sample
2. Sort peaks from LARGEST to SMALLEST
3. Add up the areas as you go down the list
4. Stop when you've added up 80% of the total
5. Everything above that line is kept

Important: A compound is kept if it makes the cut in ANY sample. This ensures we don't lose peaks that are important in specific tissues.

Why Two Steps?

StepPurposeWhat it removes
Blank SubtractionRemove contaminationPlasticizers, solvent impurities, instrument background
80% ThresholdRemove noiseLow-abundance peaks that contribute little to the biological profile
Note on the 80% cutoff: The algorithm keeps adding peaks until the cumulative sum crosses 80%. Two peaks with nearly identical contributions may get different treatment based on where the threshold falls. This is inherent to cumulative thresholds but ensures a consistent reduction in data complexity.

Step 3: Molecular Formula Assignment (MFAssignR)

After filtering, we assign molecular formulas to peaks using the MFAssignR R package, which calculates which chemical formulas could produce each measured mass.

How it works:
For each peak's m/z value:
1. Calculate neutral mass: m/z - 1.007276 (remove proton from [M+H]+)
2. Find all formulas (combinations of C, H, O, N, S, P) that match within 3 ppm
3. Apply chemical rules to filter invalid formulas:
   - H/C ratio between 0.2 and 3.0
   - O/C ratio between 0 and 1.2
   - Nitrogen rule (even/odd mass)
   - Valid double bond equivalents (DBE)
4. Use isotope patterns (13C, 34S) to confirm assignments
5. Select best-matching formula

Parameters Used

ParameterValueMeaning
Ion ModePositive [M+H]+Compounds detected as protonated molecules
Mass Error3 ppmMaximum allowed difference between measured and theoretical mass
Mass Range100-1000 DaOnly assign formulas to peaks in this range
ElementsC, H, O, N≤4, S≤2, P≤2Allowed elements and maximum counts

Understanding PPM Error

PPM (parts per million) measures how close the measured mass is to the theoretical formula mass:

ppm = (measured - theoretical) / theoretical × 1,000,000

Example: m/z 427.3778 vs C26H50O4+H theoretical 427.3782
         ppm = (427.3778 - 427.3782) / 427.3782 × 1,000,000 = -0.9 ppm

Lower ppm = higher confidence. Our assignments average ppm, which is excellent.

Formula Classes

ClassElementsTypical Compounds
CHOC, H, O onlySugars, fatty acids, terpenes
CHNO+ NitrogenAmino acids, alkaloids
CHNOS+ SulfurSulfur-containing amino acids
CHNOP+ PhosphorusPhospholipids, nucleotides
Important Limitation: A molecular formula (e.g., C26H50O4) tells you the atoms present, not the structure. Many different compounds (isomers) can share the same formula. Definitive identification requires MS/MS fragmentation data.
Assignment Results: peaks assigned formulas (% of filtered peaks). Mean mass error: ppm.

Filtered Metabolomics Data

Two-step filtered data: blank subtraction (3x threshold, pmp/Bioconductor) followed by 80% cumulative signal threshold.

Original Peaks

before any filtering

After Blank Filter

peaks removed

Final Peaks

% of original kept

What Do These Peak Names Mean?

Each compound is identified by a code like 3.90_564.1489n. This encodes two measurements:

PartExampleMeaning
First number3.90Retention time (minutes) - how long it took to pass through the column
Second number564.1489Mass (m/z) - the molecular weight detected
Suffixn or m/zJust notation style
Important: These are NOT identified compounds. We know something with mass 564.1489 eluted at 3.90 minutes, but we don't know what molecule it is yet.

How to Identify Peaks

  1. Database search - Look up the mass in METLIN, HMDB, or MassBank
  2. Run standards - Buy a pure compound and see if it matches
  3. MS/MS fragmentation - Break it apart and look at the pieces
  4. Literature - Check what others found in similar plants

What You Can Do Without Identification